REPORT LAYOUT: - Introduction - General Analysis and insights (try to find unique insights) - Analysis of factors affecting revenue - ML and prediction of select movies
In this report, we will explore the various factors that impact and influence the monetary success of a movie at the box office. Our investigation extends beyond mere fiscal considerations, encompassing a nuanced analysis of important factors such as the expertise of the cast and crew. By scrutinizing these diverse components, this report aims to provide a comprehensive understanding of the factors that defines a movie’s monetary success at the box office.
The data was obtained through the use of our own web scraping algorithm and covers the top 75 grossing movies over the past 25 years.
Over time, the average revenue demonstrates a distinct upward trend, with a notable observation regarding the rate of growth in Foreign revenue compared to Domestic revenue. The surge in global revenue is primarily driven by the rapid expansion of foreign revenue, highlighting the escalating growth and acceptance of Western films in international markets.
The onset of the Covid-19 Pandemic significantly impacted the film industry, evident in the graph. Productions were halted, and theaters closed, leading to a substantial loss of earning potential. The lockdown measures globally disrupted filming schedules, postponed releases, and the closure of theaters eliminated a crucial avenue for revenue. This had a ripple effect across the industry, affecting filmmakers, actors, crew members, distributors, and exhibitors. The industry’s vulnerability to external shocks became apparent, prompting the need for innovative adaptations to navigate the challenges such as online releases.
The impact of the month of film release is a fascinating observation. Notably, films hitting the screens in May and June consistently outperform those released in other months. Utilizing an analysis of variance (ANOVA) shows a significant disparity in average revenue across different release months. Several factors contribute to this phenomenon:
Summer Blockbuster Season: May and June fall within the traditional summer movie season in numerous regions. Studios strategically unveil high-budget blockbuster films during this period, targeting a broad audience. The warmer weather and school vacations further boost movie attendance.
Strategic Release Patterns: The film industry acknowledges this pattern, leading to a clustering effect. Recognizing the advantageous months, more popular and anticipated films tend to be strategically released during May and June. This intentional scheduling capitalizes on the observed heightened audience engagement during these months.
Genre Preferences: Certain movie genres, such as action,
adventure, and fantasy, are often associated with May and June releases
as seen in the graph. These genres tend to draw larger audiences and
generate higher revenue, contributing to the observed pattern. (Median
Revenue used to account for outliers)
|
term |
df |
sumsq |
meansq |
statistic |
p.value |
|---|---|---|---|---|---|
|
Month |
11 |
8.114459e+18 |
7.376781e+17 |
10.76122 |
0 |
|
Residuals |
1848 |
1.266797e+20 |
6.854964e+16 |
NA |
NA |
Another notable observation is the seasonality exhibited in the average revenue over a year. The seasonal strength, quantified by a value of 0.5817723, signifies a substantial recurring pattern within our data set.
This strong seasonality implies that there are recurring trends or patterns in revenue that manifest on an annual basis. It suggests that certain times of the year consistently contribute to increased or decreased revenue. Understanding and leveraging this seasonality can be pivotal for strategic decision-making in the realm of film releases.
In practical terms, this finding prompts a closer examination of the temporal distribution of revenue throughout the year. A more detailed exploration of which months or seasons contribute significantly to high or low average revenues can unveil insights that may guide release strategies, marketing efforts, or resource allocation.
| trend_strength | seasonal_strength_year | seasonal_peak_year | seasonal_trough_year | spikiness | linearity | curvature | stl_e_acf1 | stl_e_acf10 |
|---|---|---|---|---|---|---|---|---|
| 0.5208086 | 0.5817723 | 5 | 8 | 8.034234e+27 | 1142997602 | -136943061 | -0.0645241 | 0.0550351 |
| Genre | Avg Revenue |
|---|---|
| Sci-Fi | 390186395 |
| Adventure | 376108863 |
| Musical | 336913540 |
| Fantasy | 329471252 |
| Animation | 327635090 |
| Action | 318624319 |
| Family | 312315794 |
| Comedy | 230154236 |
| Thriller | 222654160 |
| Mystery | 217913536 |
cast <- df %>%
dplyr::select(Title, Worldwide, Cast) %>%
tidyr::separate_rows(Cast, sep = ", ") %>%
dplyr::group_by(Cast) %>%
dplyr::summarise(AvgRev = mean(Worldwide),
Count = n()) %>%
dplyr::mutate(Movies_Acted = case_when(
Count >= 5 & Count <= 10 ~ '5-10',
Count > 10 & Count <= 15 ~ '10-15',
Count > 15 & Count <= 20 ~ '15-20',
Count > 20 ~ '20+',
TRUE ~ 'Less than 5'
)) %>%
dplyr::group_by(Movies_Acted) %>%
dplyr::summarise(AvgRev = mean(AvgRev),
Count = n()) %>%
dplyr::arrange(desc(AvgRev))
cast
## # A tibble: 5 × 3
## Movies_Acted AvgRev Count
## <chr> <dbl> <int>
## 1 10-15 340892604. 52
## 2 20+ 325170485. 9
## 3 5-10 297994310. 291
## 4 15-20 291142359. 31
## 5 Less than 5 200239184. 2763
# Changing preference for newer faces or different types of story telling. Still like regulars
range_order <- c("Less than 5", "5-10", "10-15", "15-20", "20+")
cast$Movies_Acted <- factor(cast$Movies_Acted, levels = range_order)
ggplot(cast, aes(x = Movies_Acted, y = AvgRev, fill = factor(Count))) +
geom_bar(stat = "identity", position = "dodge", color = "black") +
scale_fill_viridis_d() +
labs(title = "Average Revenue by Number of Movies Acted",
x = "Movies_Acted",
y = "Average Revenue",
fill = "Count") +
theme_minimal()
star <- df %>%
dplyr::select(Worldwide, Star) %>%
dplyr::group_by(Star) %>%
dplyr::summarise(AvgRev = mean(Worldwide),
Count = n()) %>%
dplyr::filter(Count >= 5) %>%
dplyr::arrange(desc(AvgRev))
star
## # A tibble: 93 × 3
## Star AvgRev Count
## <chr> <dbl> <int>
## 1 Robert Downey Jr. 1065872463. 11
## 2 Chris Pratt 942798923. 9
## 3 Tom Holland 894681963. 5
## 4 Daniel Radcliffe 873331103. 9
## 5 Elijah Wood 700348692. 5
## 6 Daniel Craig 584913297. 8
## 7 Jing Wu 579671044. 6
## 8 Tobey Maguire 548438356. 5
## 9 Kristen Stewart 542133565. 7
## 10 Chris Hemsworth 521315834 6
## # ℹ 83 more rows
star_plot <- star %>%
ggplot(aes(x = Count, y = AvgRev, size = Count, color = Count,
text = paste("Star:", Star, "<br>Number of Movies:", Count, "<br>Average Revenue:", scales::dollar(AvgRev)))) +
geom_point() +
labs(title = "Movie Stars and Avg Revenue",
x = "Number of Movies",
y = "Average Revenue",
size = "Number of Movies")
plotly::ggplotly(star_plot, tooltip = "text")
#RUN REGRESSION ANALYSIS
writer <- df %>%
dplyr::select(Title, Worldwide, Writer) %>%
stats::na.omit() %>%
dplyr::mutate(Writer_Count = str_count(Writer, ",") + 1) %>%
dplyr::mutate(Grouped_Writer_Count = ifelse(Writer_Count >= 10, 10, Writer_Count)) %>%
dplyr::group_by(Grouped_Writer_Count) %>%
dplyr::summarise(AvgRevenue = mean(Worldwide),
Count = n())
writer
## # A tibble: 10 × 3
## Grouped_Writer_Count AvgRevenue Count
## <dbl> <dbl> <int>
## 1 1 192412071. 316
## 2 2 227805328. 434
## 3 3 210036030. 352
## 4 4 282824821. 255
## 5 5 301414854. 184
## 6 6 307590921. 118
## 7 7 369320972. 71
## 8 8 390427176. 37
## 9 9 363805457 39
## 10 10 477376779. 52
# FIND AND GRAPH THE TOP DIRECTORS
# DIRECTOR GENRE GRAPH
director <- df %>%
dplyr::select(Title, Worldwide, Director) %>%
dplyr::mutate(Director_Count = str_count(Director, ",") + 1) %>%
dplyr::group_by(Director_Count) %>%
dplyr::summarise(AvgRevenue = median(Worldwide),
Count = n()) %>%
dplyr::filter(Count > 2)
director
## # A tibble: 4 × 3
## Director_Count AvgRevenue Count
## <dbl> <dbl> <int>
## 1 1 162091208 1641
## 2 2 180513586 187
## 3 3 256786742 25
## 4 4 191439347 6
Building upon the insights gained from our prior information and analysis, we will employ diverse workflow models for predictive analytics. The objective is to ascertain the global box office revenue projections for upcoming, yet-to-be-released movies.
Here, our aim is to train a model for utilizing a regression tree to forecast the global revenue of an upcoming release. In our preceding analysis, we established some influencing factors which we will incorporate to help the model perform.
In the subsequent analysis, we made a deliberate effort to incorporate these noteworthy elements, encompassing budget, distributor, release month, MPAA rating, runtime in minutes, the primary genre, and the count of genres.
Before examining the tree, let’s delve into how the model assigns importance to predictor variables. Unsurprisingly, budget emerges as the top indicator of movie revenue, succeeded by release month, distributor, and other variables. This analysis underscores the notion that the count of genres does indeed influence release revenue.
Below is the visualization of our decision tree and model performance metrics after running our test set.
| .metric | .estimator | .estimate |
|---|---|---|
| rmse | standard | 2.165122e+08 |
| rsq | standard | 2.827406e-01 |
| mae | standard | 1.328968e+08 |
The model’s performance metrics suggest suboptimal accuracy in predicting test set revenue. The mean absolute error (MAE) averages around 132 million, signifying notable deviations from actual values. Additionally, the R-squared value indicates that our factors explain only about 20% of the actual values. To explore model limitations, we’ll focus on outliers, identifying areas where the model struggles for potential enhancements. The plot below depicts the connection between residuals and actual values, highlighting instances of significant prediction deviations.
The chart suggests outliers in predictions, particularly in high-revenue areas.
Now, let’s analyze residuals, focusing on budget—the most influential factor. We’ll set “close” and “bad” thresholds for estimates, emphasizing inclusivity. Using budget, a key predictor, a boxplot highlights where our estimates succeed (budget ~ $50 million) and where they struggle (budget > $150 million), possibly due to the complexity of higher-budget films with additional influencing factors.
## Worldwide predicted residuals Budget Run Time (Mins) count_genres
## 1 0.0004430002 0.02012192 0.3905226 0.000465861 0.2108682 0.3619669